Mastering Machine Learning with Spark 2.x by Tellez Alex & Pumperla Max & Malohlava Michal
Author:Tellez, Alex & Pumperla, Max & Malohlava, Michal [Tellez, Alex]
Language: eng
Format: azw3
Publisher: Packt Publishing
Published: 2017-08-31T04:00:00+00:00
Because we have already covered the preceding steps in Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, we'll quickly reproduce them in this section.
As usual, we begin with starting the Spark shell, which is our working environment:
export SPARKLING_WATER_VERSION="2.1.12" export SPARK_PACKAGES=\ "ai.h2o:sparkling-water-core_2.11:${SPARKLING_WATER_VERSION},\ ai.h2o:sparkling-water-repl_2.11:${SPARKLING_WATER_VERSION},\ ai.h2o:sparkling-water-ml_2.11:${SPARKLING_WATER_VERSION},\ com.packtpub:mastering-ml-w-spark-utils:1.0.0" $SPARK_HOME/bin/spark-shell \ --master 'local[*]' \ --driver-memory 8g \ --executor-memory 8g \ --conf spark.executor.extraJavaOptions=-XX:MaxPermSize=384M \ --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=384M \ --packages "$SPARK_PACKAGES" "$@"
In the prepared environment, we can directly load the data:
val DATASET_DIR = s"${sys.env.get("DATADIR").getOrElse("data")}/aclImdb/train"
val FILE_SELECTOR = "*.txt" case class Review(label: Int, reviewText: String)
val positiveReviews = spark.read.textFile(s"$DATASET_DIR/pos/$FILE_SELECTOR")
.map(line => Review(1, line)).toDF
val negativeReviews = spark.read.textFile(s"$DATASET_DIR/neg/$FILE_SELECTOR")
.map(line => Review(0, line)).toDF
var movieReviews = positiveReviews.union(negativeReviews)
We can also define the tokenization function to split the reviews into tokens, removing all the common words:
import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = StopWordsRemover.loadDefaultStopWords("english") ++ Array("ax", "arent", "re")
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
The Mikado Method by Ola Ellnestam Daniel Brolund(22435)
Hello! Python by Anthony Briggs(21624)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(20184)
Dependency Injection in .NET by Mark Seemann(19563)
The Well-Grounded Java Developer by Benjamin J. Evans Martijn Verburg(19311)
Kotlin in Action by Dmitry Jemerov(19237)
OCA Java SE 8 Programmer I Certification Guide by Mala Gupta(18775)
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(17577)
Adobe Camera Raw For Digital Photographers Only by Rob Sheppard(16967)
Grails in Action by Glen Smith Peter Ledbrook(16730)
Sass and Compass in Action by Wynn Netherland Nathan Weizenbaum Chris Eppstein Brandon Mathis(14220)
Secrets of the JavaScript Ninja by John Resig & Bear Bibeault(12199)
Test-Driven iOS Development with Swift 4 by Dominik Hauser(10923)
A Developer's Guide to Building Resilient Cloud Applications with Azure by Hamida Rebai Trabelsi(10597)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(10029)
Hit Refresh by Satya Nadella(9116)
The Kubernetes Operator Framework Book by Michael Dame(8538)
Exploring Deepfakes by Bryan Lyon and Matt Tora(8365)
Robo-Advisor with Python by Aki Ranin(8305)